Chapter 4 Results

4.1 Analysis of Player Heights

4.1.1 Overall Distribution of Male and Female Heights

We are first going to look at the overall distribution of players’ heights for both male and female. We obviously expect the median height of the men to be larger than the median height of women. For this analysis, we will only use the most recent year of data we have, specifically 2022.

## [1] "Median Male Player Height: 181 cm"
## [1] "Median Female Player Height: 170 cm"

We can clearly see that the average male player is taller than the average female player. We can also see that the heights of both the male and female players are approximately normally distributed. The median male player height is 181 cm, while the median female player height is 170 cm.

4.1.2 Normality of Player Heights

Let’s create a Q-Q plot for the males and females to confirm our hypothesis that the heights of each gender are normally distributed.

We can see that both the female and male players’ heights are approximately normally distributed. The sample heights align very well with the theoretical heights based on a normal distribution. We can not do a Shapiro-Wilk test in this case since we have far more than 5,000 samples, but we can clearly see the approximate normal distribution from the Q-Q plot. This is not surprising, as we know that heights for the general population are normally distributed.

4.1.3 Median Player Height per Position

Let’s get to the more interesting part of the analysis. We will visualize the median heights of both female and male players per position to see if there are some positions for which height is more of an asset than for others. We will again be using the data from 2022 only. Since most players play more than one position, we will include a row in our data for each position that the player plays. Thus, some (if not most) players will be included in the heigh caluclation for multiple positions.

The Cleveland Dot Plot above provides many interesting insights. First, we can confirm the above conclusion that the male players are taller than the female players. Further, we can see that this is true for every single position.

For both the males and the females, the tallest player on the pitch tends to be the goalkeeper. The goalkeeper is in charge of protecting the net, thus, it makes sense for a goalkeeper to be tall in order to “cover” as much of the goal as possible. We also see that the 2nd, 3rd, and 4th tallest positions for the men are the center back (CB), striker (ST), and center defensive midfielder (CDM). These positions are also among the tallest for the women as well. The tallest players (besides the goalkeeper), tend to be the defensive players (center back (CB), center defensive midfielder (CDM), left back (LB), left wing back (LWB), etc.). This also is in agreement with what we know about football; the defensive players must be able to win headers to protect the goal, which height helps. Strikers are among the only offensive players to have large heights, for both the men and the women. Strikers tend to play in the box, meaning they must compete against the tall defensive players for headers. The smallest positions on the pitch tend to be the midfielders and offensive wing players (CAM, LM, RM, RW, LF, CF, etc.). Again, this makes sense. The midfielders and offensive wing players play in the middle and edges of the pitch. They need to be fast with the ball and be able to deliver crosses into the box. Thus, height is not as crucial as speed is for these positions.

4.1.4 Distribution of Player Height per Position

Now, let’s take a look at the distribution of player heights per position. With the Cleveland Dot Plot we can see the median player height, but using a Ridgeline plot will also us to identify the modality of heights for each position.

We again can see that goalkeepers and defensive players tend to be the tallest. The distribution for each position for the male players appears to be normally distributed. Interestingly, there seems to be some bimodality and uniform distribution of the heights for some of the female player positions. For example, there seems to be two groups of players for the RB position, one who are taller and ones who are shorter. For positions like CAM, RW, ST, and CM, the distribution seems almost uniform, with heights ranging over a 15 or 20 cm range. It seems that the heights of players might not be as crucial of an attribute for the women’s game as speed, for example. The women’s game is often played more on the ground than in the air, so speed and quickness could be more advantageous than height regardless of the position (except goalkeeper and some defensive positions).

4.2 Explaining Player Wages

In this part, we want to identify what features are the most important to predict a players salary. The scope of the study will focus on the male players during the year 2021-22, excluding the goalkeepers.

4.2.1 Explaining wages by a simple scatter plot

First, let’s have an insight by plotting the wages in function of the “overall” rating (the global level of a player) and his age.

<<<<<<< HEAD
=======
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6

We can see from this graph that the wages grow with the overall rating of the player, that sounds natural. On the contrary, the age does not look like having a strong influence on the wages. Now, let’s perform a linear regression to compare the influence of each feature.

4.2.2 Explaining wages by Linear Regression

We are trying to explain the wages given the data available in the other columns through a linear regression. When done, we keep only the 30 most significant coefficients (with the lowest p-value) and we plot them in a Cleveland dot plot to compare them.

4.2.2.1 Regression 1

We can see that the club name is clearly the best indicator to estimate the salary of a player. Indeed, clubs like PSG or Manchester City are well known to be extremely rich, because they benefit from the support of a country (Qatar and Saudi Arabia). Therefore, they are able to pay their players a lot, and this seems to have a much higher influence than the actual level of the player. Once this observation is done, let’s remove the feature “club_name” to observe the influence of the others.

4.2.2.2 Regression 2

Now, we can observe that features with the highest influence come from “league_name”, that is to say the country the player plays in. Indeed, the English Premier League (and the other big european championships) is famous for getting huge amounts of money from broadcasting rights and therefore overpaying its players. Once this observation is done, let’s remove the feature “league_name” to observe the influence of the others.

4.2.2.3 Regression 3

Here, we can notice that the position of the player becomes to most important feature. For instance, Center Forwards (CF) are much better paid than Right Defensive Midfielders (RDM), probably because the position is more “spectacular” and televisual. We can also remark that the international reputation is a very important feature to explain the salary. It can be explained by the fact that football is an open competitive market: if a player with a good reputation is not well paid, he is likely to get offers from other clubs and change during the summer. Once this observation is done, let’s remove the feature “club_position” to observe the influence of the others.

4.2.2.4 Regression 4

4.3 Explaining Player Values

Now, we want to identify what features are the most important to predict a players value in the transfer market.

4.3.1 Explaining values by a simple scatter plot

First, let’s have an insight by plotting the values in function of the “overall” rating and his age.

<<<<<<< HEAD
=======
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6

We can see from this graph that the values grow with the overall rating of the player, that sounds natural (the more gifted a player, the higher his value). But unlike the wages study, here the age seems to have a strong influence on the values! Indeed, for the same overall rating, younger players clearly have a higher value.

Now, let’s perform a linear regression to compare the influence of each feature.

4.3.2 Explaining values by Linear Regression

We are trying to explain the values given the data available in the other columns through a linear regression. When done, we keep only the 30 most significant coefficients (with the lowest p-value) and we plot them in a Cleveland dot plot to compare them.

4.3.2.1 Regression 1

Like for the last study, the club name is the best way to estimate the value of a player. Indeed, it is known that the most prestigious clubs are able to attract the most valuable players. Once this observation is done, let’s remove the feature “club_name” to observe the influence of the others.

4.3.2.2 Regression 2

Now, like before, we can observe that features with the highest influence come from “league_name”, that is to say the country the player plays in. It is interesting to notice that the first league is not the English Premier League, but the Indian Super League, that is not a prestigious championship! It means that this league is probably a pool of young talents with high value, leaving the country when they reach a certain fame. Once this observation is done, let’s remove the feature “league_name” to observe the influence of the others.

Here, we can notice that the position of the player becomes to most important feature. Center Forwards may be the highest paid position, it is not the one with the highest value! Left Wingers (LW) are more valuable. Once this observation is done, let’s remove the feature “club_position” to observe the influence of the others.

Finally, we are left with only the pure skills and personality of the player. Once again, the international reputation was important, but we can now observe that ratings like “overall” or “movement_balance” are also good indicators. Finally, your age seems to strongly negatively impact your value, as we explained earlier.

4.4 Clustering the players

In this part, let’s try to cluster the players into 3 groups, visualize them in the 2D plane generated by the 2 principal components of the PCA and infer some interesting thoughts.

<<<<<<< HEAD

We can see that the clusters are very well separated in the pricipal components plane (explaining 70% of the variance). By hovering a little bit, we realize that the well-known players are in the leftmost cluster (number 3). Probably the left-to-right vector (pca_1) conveys the meaning of “fame” or “talent”. Let’s have a look on the centroids of the clusters to be sure.

##    overall potential value_eur      age height_cm weight_kg league_level
## 1 71.01614  73.90673   6040974 26.87958  179.2346  73.56735     1.252483
## 2 62.77612  69.20643   1082317 24.37759  183.1085  76.13407     1.413590
## 3 62.90256  70.06386   1107670 23.52504  178.9592  72.58113     1.425854
##   club_jersey_number weak_foot skill_moves international_reputation     pace
## 1           18.25760  3.141837    2.851024                 1.227188 70.67287
## 2           20.34325  2.766575    2.040771                 1.019835 62.00184
## 3           24.54328  3.072593    2.613936                 1.009317 71.77504
##   shooting  passing dribbling defending   physic attacking_crossing
## 1 60.40813 66.40705  69.82464  59.37058 68.82651           63.92070
## 2 36.26887 49.10560  52.83104  61.05014 66.67493           44.81855
## 3 59.19216 54.56347  63.74495  32.19449 57.82395           51.34666
##   attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1            57.56487                   58.45531                70.41620
## 2            32.04371                   58.10946                57.28007
## 3            60.30221                   52.31619                59.16324
##   attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1          53.50450        69.05897    62.11453          55.63206
## 2          31.90762        48.93462    37.94141          34.63306
## 3          52.63645        63.50776    52.04794          45.18925
##   skill_long_passing skill_ball_control movement_acceleration
## 1           65.98309           70.62135              70.97408
## 2           51.36529           54.87567              61.48577
## 3           49.77155           63.25000              71.84278
##   movement_sprint_speed movement_agility movement_reactions movement_balance
## 1              70.41077         72.02173           68.24069         70.88532
## 2              62.40459         57.24573           58.13095         60.05326
## 3              71.68828         69.64849           57.83773         69.00738
##   power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1         67.41868      67.67381      74.27250       67.25574         61.43808
## 2         46.66079      67.84224      64.81708       69.16951         34.38329
## 3         62.07104      61.41731      61.19818       59.89422         55.54484
##   mentality_aggression mentality_interceptions mentality_positioning
## 1             66.08271                59.73588              64.07604
## 2             62.37741                60.37117              40.06814
## 3             47.39616                28.46914              60.52562
##   mentality_vision mentality_penalties mentality_composure
## 1         65.83271            56.53352            67.59885
## 2         43.68007            40.46905            53.73866
## 3         56.17042            56.81464            57.17857
##   defending_marking_awareness defending_standing_tackle
## 1                    58.67489                  60.60475
## 2                    60.22608                  63.13664
## 3                    30.65450                  30.51184
##   defending_sliding_tackle       ls       st       rs       lw       lf
## 1                 57.37322 64.79469 64.79469 64.79469 67.09590 66.74441
## 2                 61.10211 48.59229 48.59229 48.59229 49.34307 48.76492
## 3                 28.78106 60.69332 60.69332 60.69332 61.50543 61.45264
##         cf       rf       rw      lam      cam      ram       lm      lcm
## 1 66.74441 66.74441 67.09590 67.45034 67.45034 67.45034 67.84171 67.23681
## 2 48.76492 48.76492 49.34307 49.61892 49.61892 49.61892 51.37668 52.34986
## 3 61.45264 61.45264 61.50543 60.68207 60.68207 60.68207 60.69604 55.45419
##         cm      rcm       rm      lwb      ldm      cdm      rdm      rwb
## 1 67.23681 67.23681 67.84171 65.51366 65.15720 65.15720 65.15720 65.51366
## 2 52.34986 52.34986 51.37668 58.02681 58.74490 58.74490 58.74490 58.02681
## 3 55.45419 55.45419 60.69604 48.77892 46.17139 46.17139 46.17139 48.77892
##         lb      lcb       cb      rcb       rb
## 1 64.39463 62.41962 62.41962 62.41962 64.39463
## 2 59.04408 61.46428 61.46428 61.46428 59.04408
=======

We can see that the clusters are very well separated in the pricipal components plane (explaining 70% of the variance). By hovering a little bit, we realize that the well-known players are in the leftmost cluster (number 3). Probably the left-to-right vector (pca_1) conveys the meaning of “fame” or “talent”. Let’s have a look on the centroids of the clusters to be sure.

##    overall potential value_eur      age height_cm weight_kg league_level
## 1 62.77612  69.20643   1082317 24.37759  183.1085  76.13407     1.413590
## 2 71.01614  73.90673   6040974 26.87958  179.2346  73.56735     1.252483
## 3 62.90256  70.06386   1107670 23.52504  178.9592  72.58113     1.425854
##   club_jersey_number weak_foot skill_moves international_reputation     pace
## 1           20.34325  2.766575    2.040771                 1.019835 62.00184
## 2           18.25760  3.141837    2.851024                 1.227188 70.67287
## 3           24.54328  3.072593    2.613936                 1.009317 71.77504
##   shooting  passing dribbling defending   physic attacking_crossing
## 1 36.26887 49.10560  52.83104  61.05014 66.67493           44.81855
## 2 60.40813 66.40705  69.82464  59.37058 68.82651           63.92070
## 3 59.19216 54.56347  63.74495  32.19449 57.82395           51.34666
##   attacking_finishing attacking_heading_accuracy attacking_short_passing
## 1            32.04371                   58.10946                57.28007
## 2            57.56487                   58.45531                70.41620
## 3            60.30221                   52.31619                59.16324
##   attacking_volleys skill_dribbling skill_curve skill_fk_accuracy
## 1          31.90762        48.93462    37.94141          34.63306
## 2          53.50450        69.05897    62.11453          55.63206
## 3          52.63645        63.50776    52.04794          45.18925
##   skill_long_passing skill_ball_control movement_acceleration
## 1           51.36529           54.87567              61.48577
## 2           65.98309           70.62135              70.97408
## 3           49.77155           63.25000              71.84278
##   movement_sprint_speed movement_agility movement_reactions movement_balance
## 1              62.40459         57.24573           58.13095         60.05326
## 2              70.41077         72.02173           68.24069         70.88532
## 3              71.68828         69.64849           57.83773         69.00738
##   power_shot_power power_jumping power_stamina power_strength power_long_shots
## 1         46.66079      67.84224      64.81708       69.16951         34.38329
## 2         67.41868      67.67381      74.27250       67.25574         61.43808
## 3         62.07104      61.41731      61.19818       59.89422         55.54484
##   mentality_aggression mentality_interceptions mentality_positioning
## 1             62.37741                60.37117              40.06814
## 2             66.08271                59.73588              64.07604
## 3             47.39616                28.46914              60.52562
##   mentality_vision mentality_penalties mentality_composure
## 1         43.68007            40.46905            53.73866
## 2         65.83271            56.53352            67.59885
## 3         56.17042            56.81464            57.17857
##   defending_marking_awareness defending_standing_tackle
## 1                    60.22608                  63.13664
## 2                    58.67489                  60.60475
## 3                    30.65450                  30.51184
##   defending_sliding_tackle       ls       st       rs       lw       lf
## 1                 61.10211 48.59229 48.59229 48.59229 49.34307 48.76492
## 2                 57.37322 64.79469 64.79469 64.79469 67.09590 66.74441
## 3                 28.78106 60.69332 60.69332 60.69332 61.50543 61.45264
##         cf       rf       rw      lam      cam      ram       lm      lcm
## 1 48.76492 48.76492 49.34307 49.61892 49.61892 49.61892 51.37668 52.34986
## 2 66.74441 66.74441 67.09590 67.45034 67.45034 67.45034 67.84171 67.23681
## 3 61.45264 61.45264 61.50543 60.68207 60.68207 60.68207 60.69604 55.45419
##         cm      rcm       rm      lwb      ldm      cdm      rdm      rwb
## 1 52.34986 52.34986 51.37668 58.02681 58.74490 58.74490 58.74490 58.02681
## 2 67.23681 67.23681 67.84171 65.51366 65.15720 65.15720 65.15720 65.51366
## 3 55.45419 55.45419 60.69604 48.77892 46.17139 46.17139 46.17139 48.77892
##         lb      lcb       cb      rcb       rb
## 1 59.04408 61.46428 61.46428 61.46428 59.04408
## 2 64.39463 62.41962 62.41962 62.41962 64.39463
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6
## 3 46.67023 42.25214 42.25214 42.25214 46.67023

The third cluster (blue) has a higher “overall”, “value” and “wage” than the others. It is therefore constituted by the good and famous players. The main difference between the 2 other clusters is the height and the weight: the second cluster (red) being taller and heavier than the first cluster (green). We can therefore guess that the bottom-to-top vector (pca_2) conveys a meaning of corpulence or, maybe, of position.

Finally, let’s have a look at the decomposition of the PCA vectors on the original features.

##                                      PC1          PC2
## overall                     -0.140523693  0.086703572
## potential                   -0.090435065  0.035016238
## value_eur                   -0.084353301  0.036262081
## age                         -0.061356100  0.069723915
## height_cm                    0.042433553  0.077178168
## weight_kg                    0.021479732  0.077799757
## league_level                 0.025426437 -0.012857498
## club_jersey_number           0.018948503 -0.040080451
## weak_foot                   -0.058222787 -0.029819159
## skill_moves                 -0.116265304 -0.063005117
## international_reputation    -0.071856297  0.035910279
## pace                        -0.070280469 -0.091149886
## shooting                    -0.143777955 -0.104736618
## passing                     -0.164980344  0.028876908
## dribbling                   -0.165759524 -0.045548935
## defending                   -0.009205539  0.230880807
## physic                      -0.040105643  0.155624325
## attacking_crossing          -0.133899240  0.002892932
## attacking_finishing         -0.128664075 -0.128067284
## attacking_heading_accuracy  -0.025076435  0.101686551
## attacking_short_passing     -0.148360162  0.069723984
## attacking_volleys           -0.128600098 -0.098078062
## skill_dribbling             -0.155804844 -0.063167006
## skill_curve                 -0.143659774 -0.039895559
## skill_fk_accuracy           -0.126509388 -0.024262092
## skill_long_passing          -0.125487893  0.096516197
## skill_ball_control          -0.162693501 -0.006901399
## movement_acceleration       -0.072708916 -0.095253248
## movement_sprint_speed       -0.064027132 -0.082190200
## movement_agility            -0.105614397 -0.084447787
## movement_reactions          -0.131043993  0.084629586
## movement_balance            -0.073426361 -0.073933592
## power_shot_power            -0.136489683 -0.043406020
## power_jumping               -0.010232627  0.079252791
## power_stamina               -0.084355115  0.091153270
## power_strength              -0.004567688  0.117585047
## power_long_shots            -0.144640948 -0.071802689
## mentality_aggression        -0.043592747  0.171457441
## mentality_interceptions     -0.014079449  0.223048042
## mentality_positioning       -0.143555102 -0.092398259
## mentality_vision            -0.154591563 -0.027775141
## mentality_penalties         -0.110843557 -0.091590110
## mentality_composure         -0.137894200  0.054452687
## defending_marking_awareness -0.010257415  0.221490101
## defending_standing_tackle   -0.003628388  0.223396363
## defending_sliding_tackle     0.003188535  0.220795425
## ls                          -0.162430468 -0.058564565
## st                          -0.162430468 -0.058564565
## rs                          -0.162430468 -0.058564565
## lw                          -0.169657617 -0.059577445
## lf                          -0.170018930 -0.059129286
## cf                          -0.170018930 -0.059129286
## rf                          -0.170018930 -0.059129286
## rw                          -0.169657617 -0.059577445
## lam                         -0.173197226 -0.039850438
## cam                         -0.173197226 -0.039850438
## ram                         -0.173197226 -0.039850438
## lm                          -0.173060439 -0.030320777
## lcm                         -0.167769487  0.060085213
## cm                          -0.167769487  0.060085213
## rcm                         -0.167769487  0.060085213
## rm                          -0.173060439 -0.030320777
## lwb                         -0.097181449  0.184855039
## ldm                         -0.079955298  0.208677637
## cdm                         -0.079955298  0.208677637
## rdm                         -0.079955298  0.208677637
## rwb                         -0.097181449  0.184855039
## lb                          -0.072772615  0.205754632
## lcb                         -0.030014201  0.231763787
## cb                          -0.030014201  0.231763787
## rcb                         -0.030014201  0.231763787
## rb                          -0.072772615  0.205754632

We can indeed conclude that:

  • pca_1 gives a bigger and negative weight to the skills and salary
  • pca_2 gives a bigger weight to height and weight

It is confirmed by a rapid hovering of the graph. The leftmost players (Mbappe, Messi, De Bruyne, Neymar) are known to be very good and very well paid. On the top we have Ruben Dias, Laporte, Van Dijk who are known to be very tall and powerful ; while on the bottom we have Muriel, Insigne or Coman wo are short and fast.

4.5 Where are the best players from ?

4.5.1 All the players

Where is football a popular sport ? Europe and South America for sure, but FIFA shows players coming from all around the world. This map will show where they are from.

<<<<<<< HEAD
=======
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6

We can see that Brazil, Argentina, Spain, France and Germany are the most represented countries, with around 1000 players each. But even though it is not the national sport, there are about 400 Americans and Chinese present in the game! On the contrary, football is very popular in Africa, but very few African players are represented. It is probably because the African leagues are not yet in the game.

4.5.2 Top 1000 players

It may be interesting to see if the countries with the most players are also the countries with the best players. For this purpose, we plot the same chart, but with only the 1000 players with the highest overall ranking.

<<<<<<< HEAD
=======
>>>>>>> b0272877face21c11ce5805705be4bacf51de0a6

We can see that Spain is by far the best country in 2022, folllowed by Brazil, Argentina and France. It is interesting to highlight that although Germany had 4 times the number of Italian players represented in FIFA (1200 VS 300), they both have approximately 50 players in the top 1000. Same remark for USA VS Canada.